

# **Instruction-Level Parallel Processing**

Joseph A. Fisher, B. Ramakrishna Rau Software and Systems Laboratory HPL-92-02 January, 1992

instruction-level parallelism, VLIW processors, Very Long Instruction Word processors. dataflow processors. superscalar processors, instruction scheduling. trace scheduling. software pipelining. modulo scheduling. pipelining, superpipeling, multiple operation issue, speculative execution, predicated execution The performance of microprocessors has increased steadily over the past twenty years at a rate of about This is the cumulative result of 50% per year. architectural improvements as well as increases in circuit speed. Moreover, this improvement has been obtained in a transparent fashion, that is, without requiring programmers to rethink their algorithms and programs, thereby enabling the tremendous proliferation of computers that we see today. continue this performance growth, microprocessor designers have incorporated Instruction-level Parallelism (ILP) into new designs. ILP utilizes the parallel execution of the lowest level computer operations - adds, multiplies, loads, and so on - to increase performance transparently. The use of LP promises to make possible, within the next few years, microprocessors whose performance is many times that of a CRAY-1S. This report provides an overview of ILP, with an emphasis on ILP architectures superscalar, VLIW and dataflow processors - and the compiler techniques necessary to make ILP work well.

\*\*

## 1 Introduction

Much of the dramatic increase in the performance of microprocessors over the past twenty years has come from continuing improvements in processor architecture. Indeed, while overall performance has grown steadily at the compounded rate of about 50% per year, raw circuit speed has accounted for less than half of that annual increase [1]. The rest has been due to such factors as an increase in word size from 8 bits to 32 bits, a move from complex instruction sets (CISC) to reduced instruction sets (RISC), the inclusion of caches as well as floating point hardware on-chip, and, most recently, the incorporation of features, such as pipelining that previously were only used in mainframes and supercomputers. These features would not be possible without an increase in chip density: the number of transistors on a microprocessor chip has increased at about 40% per year, and this trend shows no sign of abating.

An important feature of these architectural improvements is that, like circuit speed improvements, they are largely transparent to users, who do not have to change their way of working to derive performance benefits. Continuing this remarkable trend has presented a challenge to processor designers, and their response has spawned important new technologies, chief among them Instruction-level Parallelism (ILP). ILP has as its objective the execution in parallel of the lowest level machine operations, such as memory loads and stores, integer additions and floating point multiplications. These are the normal RISC-style operations that are executed when a program runs, but by using ILP multiple operations are executed in parallel, even though the system is handed a single program written with a sequential processor in mind.

Traditional Parallel Processing. Traditional parallel processing derives its speed improvements in a fundamentally different way from ILP: large sections of code are run in parallel on independent processors. There is hardly a development more exciting than parallel processing: it has the potential to change fundamentally what can be done with a computer. Some problems that would be unthinkable on an ordinary computer can be solved using massively parallel processing. But, for algorithmic reasons, many applications probably never will be subject to speed improvement through parallel processing. And when it is possible, major investments are often required-usually algorithms and code must be thoroughly rethought and reworked to be effective. There are billions of dollars worth of "dusty deck" uniprocessor oriented software, ranging from engineering and scientific simulations to commercial processing, running on the uniprocessors that constitute virtually all of the computers in use today. These programs will not soon be running on parallel processors; those that ultimately do are likely to be running on systems in which each parallel node is itself an ILP processor. Thus traditional parallel processing and instruction-level parallel processing complement each other--only infrequently are they viewed as alternatives in seeking the same end.

Instruction-level Parallelism. ILP was the first form of parallelism used to make computations go faster: thirty years ago the Control Data 6600 [2] and the experimental IBM-Stretch [3] were able to overlap functional unit operation and do some amount of dynamic code rearrangement. IBM followed the Stretch with the System/360 Model 91 [4]. They all displayed most of the characteristics of what are today called superscalar machines (a form of ILP discussed below). The CDC machine and its successors were commercially quite successful, and led to the design of the scalar unit of today's CRAY supercomputers.

In the 1970's a new style of ILP, called VLIW (for Very Long Instruction Word, another form of ILP processor discussed below) emerged. VLIW-style machines were quite successful, though not as general purpose processors. Several companies had been marketing specialized signal processing computers utilizing horizontal microcode. When fast writeable memory became relatively inexpensive, some of these manufacturers, most notably Floating Point Systems [5], replaced the read-only control store with writeable memory, giving users access to instruction-level parallelism in far greater amounts than the superscalar processors had. These machines were extremely fast, the fastest processors by far in their price ranges, for important classes of scientific applications. However, despite attempts on the part of several manufacturers to market their products for more general, everyday use, they were almost always restricted to a narrow class of applications. This was caused by the lack of good system software, which in turn was caused by the idiosyncratic architecture of processors built for a single application, and by the lack at that time of good algorithms for code generation for ILP machines with that much parallelism.

In the 1980's, three computer startups, Culler, Multiflow and Cydrome, built general-purpose VLIWs with varying degrees of parallelism. As a group, these companies were able to demonstrate that it was possible to build practical machines which achieved large amounts of ILP on scientific and engineering codes. Although, for various reasons, none was a lasting business success, several major computer manufacturers acquired access to the technologies developed at these startups, and there are several active VLIW design efforts underway.

In the 1990's, microprocessors have begun to offer more silicon space than a single RISC processor requires. Virtually every major microprocessor developer has a superscalar or VLIW system announced or under development, some of which are already being marketed [6, 7, 8]. These architectural activities promise to continue the remarkable twenty year record of microprocessor performance improvement that has sustained the computer industry and made possible the birth of new scientific disciplines such as computational physics, computational chemistry and computational cosmology.

The Challenge of ILP. While the potential of ILP is exciting, many important questions face the designer of an ILP system:

- o what gives rise to instruction-level parallelism in conventional, sequential programs and how much of it is there?
- o how is the potential parallelism identified and enhanced?
- o what must be done in order to exploit the parallelism that has been identified?
- o how should the work of identifying, enhancing, and exploiting the parallelism be divided between the hardware and the compiler?
- o what are the alternatives in selecting the architecture of an ILP processor?

In this report, with the help of a hypothetical ILP processor running a sample program, we will examine increasingly sophisticated strategies for exposing and utilizing instruction-level parallelism.



Figure 1: Execution hardware of a sample processor for executing programs with instruction-level parallelism. (A) Datapaths of the processor. Only those functional units necessary for the program example are shown. These include two identical integer units and two identical memory ports. INT1 and INT2 perform integer addition, subtraction and multiplication. MEM1 and MEM2 represent the hardware that prepares and presents load and store operations to the dual-ported data cache, BRANCH performs conditional and unconditional branches. It is able to read a pair of registers, compare them and branch on the boolean result. It is also able to store the boolean result of a comparison into the register file for use as a predicate. Every operation has a third input, a boolean value, shown as the thin arrow into each functional unit. If this boolean value is true, the operation executes normally. If not, it is not executed at all. (B) Latencies of the various operations. Only those operations necessary for the program example are listed. Integer addition and subtraction, memory stores and branches complete in a single cycle. Integer multiplication and memory loads take three and two cycles, respectively, to complete, but they are pipelined; a new operation can be started every cycle on a functional unit even if the previous ones have not yet completed. INT1 and INT2, both of which perform operations having unequal latencies, have the restriction that two operations on the same functional unit may not finish at the same time since there is only a single result bus per functional unit to communicate the result back to the register file. MEM1 and MEM2 are not subject to this restriction since store operations do not return a value to the register file.

## 2 Parallel Instruction Execution

ILP Hardware. Consider a model of hardware consisting of four functional units and a branch unit connected to a common register file (Fig. 1A). We will use this model in all three of the hypothetical processors that we will discuss below. This processor is capable of starting two integer operations, two memory operations and a branch or compare operation every cycle (though on a real machine one might also expect functional units for floating point and address computation). In addition to the input operands that one would normally expect for an operation, each operation has an additional single bit input called the predicate. If the predicate input is true, the operation executes normally. If false, the operation does nothing and the result register is not updated.

All of these operations get their input operands, including the predicate, from registers. With this hardware, the fact that five predicated operations can be started every cycle implies that up to fifteen operands need to be read from the register file and up to five results need to be stored into the register file every cycle. A register file of this kind, which is capable of supporting multiple reads and writes each cycle, is referred to as a multiported register file. (A real machine with this many functional units would be likely instead to have several, smaller, multiported register banks.)

Certain operations, such as integer addition and subtraction, complete in a single cycle (Fig. 1B). Other operations, such as integer multiplication and memory loads, take multiple cycles to complete. Nevertheless, much as in a manufacturing assembly line, this processor can start a new operation on each functional unit before the previous ones have completed, using a technique called **pipelining**. With pipelining it is possible to have multiple operations, in various stages of completion, simultaneously executing on a single functional unit. In fact, if a program were causing the longest latency operation to be issued on every unit in each cycle, this machine would be capable of starting five operations per cycle and of having eleven operations "in flight" at all times.

There is thus a fair amount of parallelism available even in this simple processor. The challenge is to make good use of it. The first question that comes to mind is whether enough ILP exists in programs to make this possible. Then, if this is so, what must the compiler and hardware do to successfully exploit it?

A Sample Program Fragment. We shall use a simple program fragment to illustrate how instruction-level parallelism can be found, enhanced and exploited (Fig. 2A). A basic block is the largest contiguous set of statements with no branches into or out of the middle of that set. The program fragment includes two loops. The inner loop is the one around basic block D. The outer loop is around the entire code fragment. On each iteration of the outer loop, the program may either execute the inner loop or it may execute basic block B. This behavior is shown pictorially in a control flow graph (Fig. 2B).

Many techniques for optimizing and parallelizing programs are dependent upon knowing which portions of the code are executed more frequently than others. A useful statistic is the frequency of transitions between each pair of basic blocks, from which one can also calculate the execution frequency for each basic block. In practice, these statistics are gathered by a profiling tool while the program runs and are saved for later use.



Figure 2: (A) A fragment of a sample program. Each statement corresponds to a single operation on the example processor in Figure 1. (Address computation is ignored for simplicity.) The first column is the label for the corresponding statement in the third column. The second column lists the labels for the basic blocks. All statements starting with the one on the same line as a basic block label up to the statement just prior to the next label are in the same basic block. (B) The control flow graph for the sample code. Each block corresponds to a single basic block. The arcs indicate the order in which flow of control can move from one basic block to another. The existence of two arcs out of a block is the result of having a conditional branch in that basic block. The number on an arc indicates the frequency of that arc, that is, the number of times that that arc was found to have been traversed during an execution of the program. The frequency of a certain basic block is the sum of the frequencies of all incoming (or outgoing) arcs.

A serial execution of a program is one in which each statement is completed before the next one is begun. The time a program takes to execute serially can be calculated by adding the latencies of all operations in a basic block, multiplying that by the number of times the block is executed, and summing over all blocks. Serial execution of our example results in 18320 operations being executed in 29680 cycles for an average execution rate of 0.62 operations per cycle (Figs. 3 A,B).

| Dania          | North and of            | F         | O          | B     | T                | T-        | 337 1 1 1 |
|----------------|-------------------------|-----------|------------|-------|------------------|-----------|-----------|
| Basic<br>Block | Number of<br>Operations | Frequency | Operations | Basic | Execution        | Frequency | Weighted  |
| DIOCK          | Operations              |           | Executed   | Block | Time             |           | Time      |
| A              | 5                       | 1000      | 5000       | A     | 9                | 1000      | 9000      |
| С              | 2                       | 120       | 240        | Ċ     | 2                | 120       | 240       |
| D              | 6                       | 1200      | 7200       | Ď     | 9                | 1200      | 10800     |
| E              | 1                       | 120       | 120        | E     | 1                | 120       | 120       |
| В              | 2                       | 880       | 1760       | В     | 4                | 880       | 3520      |
| F              | 4                       | 1000      | 4000       | F     | 6                | 1000      | 6000      |
| Total          |                         |           | 18320      | Total | _                |           | 29680     |
|                |                         |           |            |       | ge Operations P  | er Cycle  | 0.62      |
|                |                         | A         |            |       |                  | В         |           |
| Basic          | Execution               | Frequency | Weighted   | Basic | Execution        | Frequency | Weighted  |
| Block          | Time                    | Troqueizy | Time       | Block | Time             | Troquency | Time      |
| Dioon          | 2                       |           | 11110      | Diook | IIIIO            |           | 710       |
| A              | 7                       | 1000      | 7000       | A     | 7                | 1000      | 7000      |
| С              | 1                       | 120       | 120        | С     | 5                | 120       | 600       |
| D              | 6                       | 1200      | 7200       | D     | 2                | 960       | 1920      |
| E              | 1                       | 120       | 120        | E     | 4                | 120       | 480       |
| В              | 4                       | 880       | 3520       | В     | 4                | 880       | 3520      |
| F              | 4                       | 1000      | 4000       | F     | 4                | 1000      | 4000      |
| Total          |                         |           | 21960      | Total |                  |           | 17520     |
| Avera          | age Operations P        | er Cycle  | 0.83       |       | ge Operations F  | Per Cycle | 1.05      |
|                | •                       | C         |            | D     |                  |           |           |
| Basic          | Execution               | Frequency | Weighted   | Basic | Execution        | Frequency | Weighted  |
| Block          | Time                    |           | Time       | Block | Time             |           | Time      |
|                |                         |           |            |       |                  |           |           |
| 1A             | 8                       | 500       | 3500       | H     | 8                | 1         | 8         |
| 1C             | 5<br>2                  | 60        | 300        | C     | 5                | 120       | 600       |
| 1D             |                         | 480       | 960        | D     | 2                | 960       | 1920      |
| 1E             | 4                       | 60        | 240        | E     | 4                | 120       | 480       |
| 1F             | 1                       | 500       | 500        | ABF   | 4                | 998       | 3992      |
| 2A             | 1                       | 500       | 500        | G     | 8                | 1         | 8         |
| 2C             | 5                       | 60        | 300        | Total |                  |           | 7008      |
| 2D             | 2                       | 480       | 960        | Avera | ige Operations I | Per Cycle | 2.61      |
| 2E             | 4                       | 60        | 240        | 1     | - <b>-</b>       | -         |           |
| 2F             | 1                       | 500       | 1500       |       |                  |           |           |
| Total          |                         |           | 8500       |       |                  |           |           |
| Avera          | age Operations F        | Per Cycle | 2.16       |       |                  |           |           |
|                |                         | E         |            |       |                  | F         |           |
| Eigen 2        | t. (A) The numb         |           |            | ·     | C - 1            | (D, E) (  |           |

Figure 3: (A) The number of operations executed in the serial execution of the program. (B-F) Execution time (per basic block and total) for various schemes that are discussed over the course of this article. The time to execute the sample code can be calculated by first computing, for each basic block, the execution time for a single visit to that basic block, next, weighting the execution time for each basic block by its execution frequency and, finally, adding up the weighted execution times. In the case of serial execution, the execution time for each basic block is computed by adding up the latencies of all the operations in that basic block. (B) Serial execution. (C) Execution exploiting only the ILP within individual basic blocks. (D) Execution exploiting only the ILP within individual basic blocks except for loop D which is software pipelined. (E) Outer loop unrolled once, basic blocks 1A, 1B, 1F, 2A, 2B, 2F trace scheduled, and both copies of loop D software pipelined. (F) Outer loop trace through A, B, F software pipelined and loop D software pipelined.

| TIME | INT1       | <u>INT2</u> | MEM1      | MEM2       | BRANCH              |
|------|------------|-------------|-----------|------------|---------------------|
|      |            |             |           |            |                     |
| 1    | _          | _           | A1        | <b>A</b> 2 | -                   |
| 2    |            | _           | _         | _          | -                   |
| 3    | <b>A</b> 3 | _           | _         | _          | -                   |
| 4    | -          | _           | _         | _          | _                   |
| 5    | _          | _           | _         | _          | _                   |
| 6    | A4         | _           | _         | -          |                     |
| 7    | _          | _           | _         | _          | <b>A</b> 5          |
| , ·  |            |             | •         |            |                     |
|      |            |             | A         |            |                     |
| TIME | INT1       | INT2        | MEM1      | MEM2       | BRANCH              |
|      |            |             |           |            |                     |
| 1    | B1         | -           | -         | -          | -                   |
| 2    | B2         | _           | -         | _          | -                   |
| 3    | <b>-</b>   | -           | _         | -          | -                   |
| 4    | _          | -           | _         | _          | -                   |
|      |            |             | В         |            |                     |
|      |            |             |           |            |                     |
| TIME | INT1       | INT2        | MEM1      | MEM2       | BRANCH              |
|      |            |             |           |            |                     |
| 1    | C1         | C2          |           | _          | -                   |
| 1    |            |             | C         |            |                     |
| TIME | INT1       | INT2        | MEM1      | MEM2       | BRANCH              |
| TIME | 11111      | INIZ        | DIENGIT   | HENZ.      | <u> BRANCH</u>      |
| 1 .  | 70         | 25          | <b>D1</b> |            |                     |
| 1    | D2         | D5          | D1        | -          | -                   |
| 2    |            | _           |           | _          | -                   |
| 3    | D3         | _           | -         | -          | -                   |
| 4    | -          | -           | _         | -          | -                   |
| 5    |            | _           | _         | -          | <u>-</u>            |
| 6    | D4         | _           | _         | -          | D6                  |
|      |            |             | D         |            |                     |
| TIME | INT1       | INT2        | MEM1      | MEM2       | BRANCH              |
|      |            |             |           |            |                     |
| 1    | _          | _           | _         | _          | E1                  |
| 1 -  | _          |             | _         |            | <b>11</b> 1.        |
|      |            |             | E         |            |                     |
| TIME | INT1       | INT2        | MEM1      | MEM2       | BRANCH              |
|      |            |             |           |            |                     |
| 1    | F1         | F3          | -         | -          | -                   |
| 2    | -          | -           | -         | -          | -                   |
| 3    | -          | -           | -         | -          | -                   |
| 4    | F2         | -           | -         | -          | F4                  |
|      |            |             | F         |            |                     |
|      |            |             |           |            | ing she II D mishin |

Figure 4: Record of execution for each basic block when exploiting the ILP within individual basic blocks. (A) Basic block A. (B) Basic block B. (C) Basic block C. (D) Basic block D. (E) Basic block E. (F) Basic block F.

Basic Block ILP. A program can run faster if we use the instruction-level parallelism that exists between instructions within a basic block. Instead of starting an instruction only after the previous instruction has finished, one could start it as soon as the instructions that generate its input operands have finished and sufficient resources, such as functional units and registers, exist to execute it. This execution strategy yields a record of execution for most basic blocks that is slightly more parallel than before (Fig. 4). It results in a slight reduction in the execution times for basic blocks A, C, D and F, dropping the total execution to 21960 cycles for an average of 0.83 operations per cycle (Fig. 3C) and with as many as 3 operations being issued at one time.

Parallelism within each basic block is limited by the data dependences between instructions, which results in serial execution. This is compounded by the execution latency of operations, which lengthens the time to execute the serial chain of instructions. For instance, in basic block D (Fig. 5A), statement D3 uses T6 which is the result of D1. So, D3 is dependent on D1. Likewise, D4 is dependent on D3 through its use of T7. On the other hand, D5 is not dependent on D1, D2, D3 or D4; independences like this can lead to a small amount of parallelism within a single basic block. The dependences between operations are often represented as a graph (Fig. 5B) with the nodes representing the statements and the arcs representing the dependences between statements. The longest path through a dependence graph is called the critical path. Consider the dependence graph for basic block D. Taking into account the latencies of the operations, the critical path for D, through D1, D3 and D4 (which are thus called critical path operations), is 6 cycles long. 6 cycles is thus the shortest time in which basic block D can be executed.



Figure 5: (A) The code for basic block D. Parallelism within a basic block is limited by the data dependences between instructions and by the execution latency of operations. For instance, operation D3 uses T6 which is the result of D1. So, D3 is dependent on D1 and cannot begin until D1 has finished. Likewise, D4 is dependent on D3 through its use of T7. On the other hand, D5 is not dependent on D1, D2, D3 or D4. Such independence leads to a small amount of parallelism within individual basic blocks. (B) The dependence graph for basic block D. The ILP within the basic block is represented by a graph with the nodes representing the operations and the arcs representing the dependences between operations. The input to D1, labelled "@X(I-J)", is the address of the array element X(I-J).

Global ILP. Exploiting the parallelism within individual basic blocks led only to a 30% speedup (Fig. 3C); we could expect more if the computations in separate basic blocks could also be executed in parallel with one another. The problem is that basic blocks are separated by branches, most often conditional branches. Frequently, the branch condition can only be calculated at the end of the basic block and only then is it possible to determine the identity of the basic block that is to be executed next. This would appear to frustrate any attempts to execute basic blocks in parallel.

One approach is to recognize that although we may not yet know whether the next basic block is to be executed, we could be certain that a subsequent one must be. In our example, while executing basic block A, even though branch A5 may not have been executed, it is certain that basic block F must be executed since all paths from A lead to it. In addition, we could test the branch condition in F4 immediately to determine whether another iteration is to be executed. If this turns out to be the case, we now know that basic blocks A and F from the next iteration, too, must definitely be executed. This already allows us to execute the operations from four basic blocks in parallel (subject to the dependences between them) without yet knowing which way the branch A5 in the first iteration is going. This type of parallelism we shall term control parallelism.

| TIME | INT1 | INT2 | MEM1 | MEM2 | BRANCH |
|------|------|------|------|------|--------|
| 1    | 1F3  | 1F1  | 1A1  | 1A2  | _      |
| 2    | 2F3  | -    | 2A1  | 2A2  | -      |
| 3    | 1B1  | 1A3  | 3A1  | 3A2  | -      |
| 4    | 3F3  | 1B2  | _    | -    | -      |
| 5    | 2B1  | 2A3  | 4A1  | 4A2  | -      |
| 6    | 3B1  | 3A3  | -    | -    | -      |
| 7    | 4F3  | 2B2  | _    | _    | -      |
| 8    | 1A4  | 3B2  | 5A1  | 5A2  | _      |
| 9    | 1F2  | 4A3  | _    | -    | 1A5    |
| 10   | 4B1  | 2F1  | -    | -    | 1F4    |
| 11   | 2A4  | 4B2  | -    | -    | -      |
| 12   | 5F3  | 5A3  | -    | _    | 2A5    |
| :    | :    | :    | :    | :    | :      |

Figure 6: Record of execution of basic blocks A, B and F being executed repeatedly and speculatively while maximally exploiting the global ILP across basic blocks. The numerical prefix on each operation is the outer loop iteration to which that particular operation belongs. The last operation of basic block A of the first iteration is executed at time 9. Prior to this, a number of operations from subsequent basic blocks of the first iteration, as well as operations from subsequent iterations, have been executed speculatively. Since there are seven integer operations per iteration of the outer loop, the integer units are the most heavily used resource. The best achievable performance corresponds to keeping these units completely utilized, which is in fact the case from cycle 4 onwards. This record of execution would occur if the flow of control always followed the expected path. But if, for instance, branch 1A5 branched to 1C (contrary to expectations) only those operations executed in cycles 1 through 9 that were originally from basic block 1A would have been useful.

At times, we might find that there is inadequate control parallelism to fully utilize the functional units. The second approach for finding more parallelism, termed speculative execution, relies on the fact that there is little to be lost in using otherwise idle resources to execute operations whose results we might, possibly, end up using. Although we may not yet be positive that the program will actually flow to certain subsequent blocks, we can still choose to speculatively execute operations from those blocks in parallel. Operations executed speculatively must be capable of being ignored if the flow does not later mandate their execution (and in real machines one must solve the difficult problem of treating error conditions raised by speculative operations). Thus, for example, stores to memory are rarely candidates for speculative execution.

Without knowing which way any branch actually does go, the transition frequencies for our example (Fig. 2B) show that the path from basic block A to B to F and back to A will be repeatedly executed multiple times. Thus a reasonable strategy is to use every cycle that would otherwise be wasted to execute instructions speculatively down this expected path. If we execute operations on this path as soon as their input operands are available, the resulting record of execution displays considerable parallelism (Fig. 6). When this is compared to the records of execution (Fig. 4) for the individual basic blocks A, B and F with no speculative execution, the improvement (assuming that branches always go in the expected direction) is almost a factor of four. In this case 3 or 4 operations are consistently started per cycle. This is by no means an atypical situation.

For the past twenty years, researchers have been gathering experimental data on the amount of available instruction-level parallelism for various workloads and under a variety of assumptions ([9, 10, 11, 12]). Although the results have varied with the experimental conditions, a consistent pattern has emerged: potential ILP, when limited to basic blocks, offers speedups of between a factor of 2 and 3.5. When techniques that move operations from block to block are used, the potential jumps to a factor of between 5 and 20, with even more speedup available in various applications such as array-oriented programs. By using estimated probabilities of branch direction, a system can selectively execute those operations which are most likely to be profitable. Experiments ([13, 14]) and experience have indicated that branch directions are sufficiently predictable to make speculative execution practical.

# 3 ILP Architectures

So far we have only demonstrated the end result of instruction-level parallelism, the record of execution, without having said much about how this is effected. How exactly are the necessary decisions made as to when an operation should be executed and whether an operation should be speculatively executed? The alternatives can be broken down depending on the extent to which these decisions are made by the compiler rather than by the hardware and on the manner in which information regarding parallelism is communicated by the compiler to the hardware via the program.

A computer architecture is a contract between the class of programs that are written for the architecture and the set of processor implementations of that architecture. Usually this contract is concerned with the instruction format and the interpretation of the bits that constitute an instruction, but in the case of ILP architectures it extends to information embedded in the program pertaining to the available parallelism. With this in mind, ILP architectures can be classified as follows.

- o Sequential architectures: architectures for which the program is not expected to convey any explicit information regarding parallelism. Superscalar processors ([2, 4, 6, 8, 15]) are representative of ILP processor implementations for sequential architectures.
- o Dependence architectures: architectures for which the program explicitly indicates the dependences that exist between operations. Dataflow processors ([16, 17, 18]) are representative of this class.
- o Independence architectures: architectures for which the program provides information as to which operations are independent of one another. Very Long Instruction Word (VLIW) ([5, 19, 20, 7]) processors are an example of the class of independence architectures.

These architectural alternatives are elaborated upon below and the three types of processors are discussed in the next section. Note that our list does not include vector architectures (and associated processors), although one could be defined corresponding to our assumed execution hardware (Fig. 1). Vector processors are not true ILP processors. They are best thought of as complex instruction set computers (CISC), that is, processors for a sequential architecture with (complex) vector instructions which possess a certain stylized parallelism internal to each vector instruction. However, despite the commercial success that they have had, vector processors are less general in their ability to exploit all forms of instruction-level parallelism due to their stylized approach to parallelism. Vector processors are treated in many textbooks, for instance, the one by Hwang and Briggs ([21]).

Sequential architectures. Since the compiler neither identifies parallelism nor makes scheduling decisions, the program contains no explicit information regarding the dependences between the instructions. If instruction-level parallelism is to be employed, the dependences that exist must be determined by the hardware, which must then make the scheduling decisions as well.

Dependence architectures. The compiler, or the writer of a program, identifies the parallelism in the program and communicates it to the hardware by specifying the dependences between operations. This is typically done by specifying, for each operation, the list of other operations that are dependent upon it. The hardware must make scheduling decisions at run-time which are consistent with the dependence constraints.

Independence architectures. The compiler identifies the parallelism in the program and communicates it to the hardware by specifying which operations are independent of one another. This information is of direct value to the hardware, since it knows with no further checking which operations it can execute in the same cycle. Unfortunately, for any given operation, the number of independent operations is far greater than the number of dependent ones and, so, it is impractical to specify all independences. Instead, for each operation, independences with only a subset of all independent operations (those operations that the compiler thinks are the best candidates to execute concurrently) are specified.

By listing operations that could be executed simultaneously, code for an independence architecture may be very close to the record of execution produced by an implementation of that architecture. If the architecture additionally requires that programs specify where (on which functional unit) and when (in which cycle) the operations are executed, then the hardware makes no run-time decisions and the code is virtually identical to the desired record of execution. The VLIW processors that have been built to date are of this type and represent the predominant examples of machines with independence architectures. A particular processor implementation of an independence architecture of this type could choose to disregard the scheduling decisions embedded in the program, making them at run-time instead. In doing so, the processor would still benefit from the independence information but would have to perform all of the scheduling tasks of a superscalar processor. Furthermore, when attempting to execute concurrently two operations that the program did not specify as being independent of each other, it must determine independence, just as a superscalar processor must.

The key distinguishing features of these three ILP architectures are summarized in Table 1. We now turn our attention to the types of processors built to implement these architectures.

| Table 1: A comparison of the instruction-level parallel architecture types discussed in this paper. |                                                                                           |                                                          |                                                                                                                                                   |  |  |  |  |
|-----------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------|----------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------|--|--|--|--|
|                                                                                                     | Sequential<br>Architecture                                                                | Dependence<br>Architecture                               | Independence<br>Architecture                                                                                                                      |  |  |  |  |
| Additional information required in the program                                                      | None                                                                                      | Complete specification of dependences between operations | Minimally, a partial list<br>of independences.<br>Typically, a complete<br>specification of when<br>and where each operation<br>is to be executed |  |  |  |  |
| Typical kind of ILP processor                                                                       | Superscalar                                                                               | Dataflow                                                 | VLIW                                                                                                                                              |  |  |  |  |
| Analysis of<br>dependences<br>between operations                                                    | Performed by hardware                                                                     | Performed by the compiler                                | Performed by the compiler                                                                                                                         |  |  |  |  |
| Analysis of independent operations                                                                  | Performed by hardware                                                                     | Performed by hardware                                    | Performed by the compiler                                                                                                                         |  |  |  |  |
| Final operation scheduling                                                                          | Performed by hardware                                                                     | Performed by hardware                                    | Performed by the compiler                                                                                                                         |  |  |  |  |
| Role of compiler                                                                                    | Rearranges the code to<br>make the analysis and<br>scheduling hardware<br>more successful | Replaces some analysis<br>hardware                       | Replaces virtually all<br>the analysis and<br>scheduling hardware                                                                                 |  |  |  |  |

# 4 ILP Processor Implementations

Superscalar processors. The goal of a superscalar processor is to execute many operations in parallel even though the hardware is handed a sequential program. But a sequential program is constructed with the assumption only that it will execute correctly when each operation waits for the previous one to finish, and that is the only order that the architecture guarantees to be correct. The first task, then, for a superscalar processor is to understand, for each instruction, which other instructions it actually is dependent upon. With every instruction that a superscalar processor issues, it must check whether the instruction's operands (registers or memory locations that the instruction uses or modifies) interfere with the operands of any other instruction that is either:

- o already in execution, or
- o has been issued but is waiting for the completion of interfering instructions that would have been executed earlier in a sequential execution of the program, or
- o is being issued concurrently but would have been executed earlier in a sequential execution of the program.

If none of these conditions is true, the instruction in question is independent of all these other instructions and it can be allowed to begin executing immediately. If not, it must be delayed until the instructions on which it is dependent have completed execution. In the meantime, the processor may begin execution of later instructions which prove to be independent of the ones that are being delayed. In addition, the superscalar processor must decide exactly when and on which available functional unit to execute the instruction. The IBM System/360 Model 91, built in the early 1960's, utilized a method called Tomasulo's Algorithm ([22]) to carry out these functions.

Note that a superscalar processor need not issue multiple operations per cycle in order to achieve a certain level of performance. For instance, in the case of our sample processor (Fig. 1), the same performance could be achieved by pipelining the functional units and instruction issue hardware five times as deeply, speeding up the clock rate by a factor of five but issuing only one instruction per cycle. This strategy has been termed **superpipelining** ([23]). Superpipelining may result in some parts of the processor (such as the instruction unit and communications busses) being less expensive and better utilized and other parts (such as the execution hardware) being more costly and less well used.

Dataflow processors. The objective of a dataflow processor is to execute an instruction at the earliest possible time subject only to the availability of the input operands and a functional unit upon which to execute the instruction. Unlike a superscalar processor, it counts on the program to provide information about the dependences between instructions. This is accomplished by including in each instruction a list of successor instructions. (An instruction is a successor of another instruction if it uses as one of its input operands the result of that other instruction.) Each time an instruction completes, it creates a copy of its result for each of its successor instructions. As soon as all of the input operands of an instruction are available, the hardware fetches the instruction, which specifies the operation to be performed and the list of successor instructions. The instruction is then executed as soon as a functional unit of the requisite type is available (Fig. 7).



Figure 7: An example of the execution of the computation graph in Figure 5 by a dataflow processor. The markings on arcs indicate inputs that are available. (A) The state of the processor when computation begins at time 1. D1, D2 and D5 have all their inputs available and begin execution. (B) The state at time 2. D2 and D5, both of which have latencies of one cycle, have completed, their results are available and D6 begins executing. (C) The state at time 3. D1, whose latency is two cycles, has completed and D3 begins execution with a latency of three cycles. Also, the branch, D6, has completed and the dataflow processor could begin a new iteration of D at this time, thereby achieving parallelism between successive iterations of D. (D) The state at time 6. D3 has completed and D4 can now begin. D4 and the entire computation complete 6 cycles from when the computation began.

This property, whereby the availability of the data triggers the fetching and execution of an instruction, is what gives rise to the name of this type of processor. Because of this property, it is redundant for the instruction to specify its input operands. Rather, the input operands specify the instruction! If there is always at least one instruction ready to execute on every functional unit, the dataflow processor achieves peak performance.

As we have seen, computation within a basic block typically does not provide adequate levels of parallelism. Thus superscalar and VLIW processors use control parallelism and speculative execution to keep the hardware fully utilized. Dataflow processors have traditionally counted on using control parallelism alone to fully utilize the functional units. A dataflow processor is more successful than the others at looking far down the execution path to find abundant control parallelism. When successful, this is a better strategy than speculative execution since every instruction executed is a useful one and the processor does not have to deal with error conditions raised by speculative operations.

Very Long Instruction Word (VLIW) processors. In order to execute operations in parallel, the system must determine that the operations are independent of one another. Superscalar processors and dataflow processors represent two ways of deriving this information at run-time. In the case of the dataflow processor, the explicitly provided dependence information is used to determine when an instruction may be executed so that it is independent of all other concurrently executing instructions. The superscalar processor must do the same but, since programs for it lack any explicit information, it must also first determine the dependences between instructions. In contrast, the program for a VLIW processor specifies exactly which functional unit each operation should be executed on and exactly when each operation should be issued so as to be independent of all operations that are being issued at the same time as well as of those that are in execution.

With a VLIW processor, it is important to distinguish between an instruction and an operation. An operation is a unit of computation, such as an addition, memory load or branch, which would be referred to as an instruction in the context of a sequential architecture. A VLIW instruction is the set of operations that are intended to be issued simultaneously. It is the task of the compiler to decide which operations should go into each instruction. This process is termed scheduling. Conceptually, the compiler schedules a program by emulating at compile-time what a dataflow processor, with the same execution hardware, would do at run-time. All operations that are supposed to begin at the same time are packaged into a single VLIW instruction. The order of the operations within the instruction specifies the functional unit on which each operation is to execute. A VLIW program is a transliteration of a desired record of execution which is feasible in the context of the given execution hardware.

The compiler for a VLIW machine specifies that an operation be executed speculatively merely by performing speculative code motion, that is, scheduling an operation before the branch that determines that it should, in fact, be executed. At run-time, the VLIW processor blindly executes this operation exactly as specified by the program just as it would for a non-speculative operation. Speculative execution is virtually transparent to the VLIW processor and requires little additional hardware. When the compiler decides to schedule an operation for speculative execution, it can arrange to leave behind enough of the state of the computation to assure correct results when the flow of the program requires that the operation be ignored. The hardware required for the support of speculative code motion consists of having some extra registers, of fetching some extra instructions, and of suppressing the generation of spurious error conditions. The VLIW compiler must perform many of the same functions that a superscalar processor performs at run-time to support speculative execution, but it does so at compile-time.

Other types of processors with independence architectures have been built or proposed. The superpipelined machine, described above, issues only one operation per cycle. But if there is no superscalar hardware devoted to preserving the correct execution order of operations, the compiler will have to schedule them with full knowledge of dependencies and latencies. From the compiler's point of view, these machines are virtually the same as VLIWs, though the hardware design of such a processor offers some tradeoffs with respect to VLIWs. Another proposed independence architecture, dubbed *Horizon* ([24]), encodes an integer *H* into each operation. The architecture guarantees that all of the past *H* operations in the instruction stream are data-independent of the current operation. All the hardware has to do to release an operation, then, is assure itself that no operation older than the *H*th previous operation is in flight or pending. The hardware does all of its own scheduling, unlike VLIWs and deeply pipelined machines which rely on the compiler, but the hardware is relieved of the task of determining data dependence.

# 5 ILP Compiler Techniques

Regardless of whether an ILP processor makes the final scheduling decisions at run-time or compile-time, a compiler must generate code in which operations are rearranged for better performance. In the case of VLIW processors, the operations are arranged precisely in their execution order. In the case of superscalar processors, the hardware can only consider some number of nearby operations (called the **instruction window**) at one time. Compilers, using their understanding of the processor's hardware scheduling algorithms, will attempt to produce a rearranged program loosely tailored for that hardware, giving the processor the greatest possible opportunity to find parallelism.

Various compiler techniques have been developed to schedule operations exactly or approximately. The simplest of these address a single basic block, and are often referred to as local scheduling. For example, a compiler might schedule each operation using a procedure similar to the following:

- 1. Use heuristics to pick one of the unscheduled operations whose predecessors have already been scheduled. Simple heuristics usually suffice.
- 2. Calculate the completion time for each of its predecessor operations by adding their execution latency to the time at which they are scheduled to begin execution. Take the largest of these completion times. This is the earliest time at which the operation can be scheduled to begin execution.
- 3. Determine the earliest time thereafter when an appropriate functional unit is available to perform the operation, that is, no previously scheduled operation has been assigned to that functional unit at the same time. Schedule the operation at this time on this functional unit.
- 4. If there are any remaining unscheduled nodes, repeat the procedure from step 1.

This procedure can be applied to the computation in basic block D to obtain a schedule (Fig. 4D). Note the similarity between the above procedure and the sequence of steps by which a dataflow processor would execute the same computation (Fig. 7).

If the compiler is generating code for a processor in which the hardware will make the scheduling decisions at run-time, it will not actually schedule the operations on a given functional unit for a given cycle. Instead, using its understanding of the processor's hardware scheduling algorithms, the compiler will attempt to produce a rearranged program tailored for that hardware, giving the processor the greatest possible opportunity to find parallelism. (The compiler, which has already found most of the same parallelism, discards it in producing a serial program.)

Global Scheduling Techniques. We have seen that most of the available parallelism is found beyond the boundaries of basic blocks. Until a decade ago, compiler techniques which scheduled ILP code by moving it from block to block (called global scheduling) were virtually nonexistent. But now two related sets of technologies have matured to the level that they are used in commercial environments. The first of these, which we may group under the heading software pipelining, schedules a loop so that successive iterations of the loop execute concurrently using code that is as compact as possible. The second set of techniques, which we may call trace directed scheduling, deal with scheduling computations with a more general flow of control, either where the loop structure is not relevant, or within a (possibly software pipelined) loop body with a complex flow of control.

Loop Parallelism and Software Pipelining. We begin by considering the loop around the basic block D, which is the most frequently executed basic block in the program. A record of execution of loop D would be ideal if it executed all operations as soon as their inputs and an appropriate functional unit were available (Fig. 8). This is what a dataflow processor would do. If we knew the exact trip count, that is, the number of iterations of the loop actually executed, we could achieve the same result as the dataflow processor by unrolling the loop completely, treating the resulting code as a single basic block and generating the best possible schedule. Unfortunately, we do not generally know the trip count at compile-time and, even if we did, the unrolled code would usually be too large for this to be a practical strategy. Our goal, therefore, is to approach the performance of this impractical approach in a practical way.

| TIME       | INT1 | INT2 | MEM1 | MEM2 | BRANCH |
|------------|------|------|------|------|--------|
| 1          | 1D2  | 1D5  | 1D1  | _    | _      |
| 2          | _    | -    | -    | _    | 1D6    |
| 3          | 1D3  | 2D5  | 2D1  | _    | -      |
| 4          | 2D2  | _    | _    | -    | 2D6    |
| 5          | 2D3  | 3D5  | 3D1  | _    | -      |
| 6          | 1D4  | 3D2  | -    | _    | 3D6    |
| 7          | 3D3  | 4D5  | 4D1  | _    | -      |
| 8          | 2D4  | 4D2  | _    | _    | 4D6    |
| 9          | 4D3  | 5D5  | 5D1  | _    | -      |
| 10         | 3D4  | 5D2  | _    |      | 5D6    |
| <b>!</b> : | :    | :    | :    |      | :      |

Figure 8: The record of execution of multiple iterations of D being executed overlapped with one another in a non-speculative fashion. Iteration 1 begins at time 1 and ends at time 6 with operation 1D4. In the meantime, iterations 2 and 3 have begun. Since there are four integer operations per iteration and two integer units, the average execution rate can be no better than one iteration every two cycles. The execution settles down into a pattern whereby an iteration is completed every two cycles from time 6 onwards. We see here that from iteration 3 onwards, every iteration has the same schedule. Consequently, each pair of cycles from time 6 onwards have exactly the same execution pattern. In general, this may not be the case.

To achieve this, imagine the following conceptual strategy. First, we unroll the loop completely. Then we schedule the code (Fig. 9A), but with two constraints:

- o all iterations have identical schedules except that
- o each iteration is scheduled some fixed number of cycles later than the previous iteration.

This fixed delay between the start of successive iterations is termed the initiation interval--two cycles in this case. This unrolled code, after scheduling, is repetitive except for a small portion at the beginning and, likewise, a small portion at the end. From cycle 5 onwards, the pattern of the record of execution repeats itself with a time period of 2 cycles (Fig. 9A). This repetitive portion can be re-rolled to yield a new loop which is known as the kernel (Fig. 9B). The prologue is the code corresponding to the record of execution that precedes the repetitive part and the epilogue is the code corresponding to the record of execution following the repetitive part. By executing the prologue, followed by the kernel an appropriate number of times, and finally the epilogue, one would come close to re-creating the ideal record of execution of the unrolled code (Fig. 8). Thus, a relatively small amount of code is able to approximate the ideal (but impractical) strategy of unrolling the loop completely. This technique for executing loops is known as software pipelining ([25, 26, 27, 28]).

Software pipelined code can be generated using an algorithm known as modulo scheduling ([25]). There are two steps to this: first, determining the initiation interval and, second, creating the schedule. The objective is to come up with a schedule having the smallest possible initiation interval, since this corresponds to the maximum performance. The limiting factor in deciding the initiation interval can either be a critical chain of dependences running through the loop iterations or a critical resource that is utilized fully. In the case of loop D, the integer units are the critical resources. Since there are three integer operations per iteration and only two integer units, the maximum sustained rate at which new iterations can be started is every two cycles, the smallest integer that is not less than 3/2. (By unrolling the loop once this could be changed to two iterations started every three cycles, thus using the machine to its maximum capacity.) Scheduling itself proceeds much like local scheduling except that in Step 3, resource conflicts must be avoided not only with operations from the same iteration but with operations from previous and subsequent iterations as well. This is done by ensuring that, within a single iteration, no machine resource is used at two points in time that are separated by a time interval that is a multiple of the iteration interval.

Compared to locally scheduled code, whose execution time is 21960 cycles (Fig. 3C), the execution time with a software pipelined loop D drops to 17520 cycles (Fig. 3D) for a further 26% increase in performance over local scheduling, yielding 1.05 operations per cycle.

| TIME | MOD   | INT1 | INT2 | MEM1     | MEM2 | BRANCH |  |  |
|------|-------|------|------|----------|------|--------|--|--|
|      |       |      |      |          |      |        |  |  |
| 1    | 1     | 1D5  | _    | 1D1      | -    | -      |  |  |
| 2    | 0     | 1D2  | -    | _        | _    | 1D6    |  |  |
| 3    | 1     | 2D5  | 1D3  | 2D1      | -    | -      |  |  |
| 4    | 0     | 2D2  | _    | _        | -    | 2D6    |  |  |
| 5    | 1     | 3D5  | 2D3  | 3D1      | -    | -      |  |  |
|      |       |      |      |          |      |        |  |  |
| 6    | 0     | 3D2  | 1D4  | -        | -    | 3D6    |  |  |
| 7    | 1     | 4D5  | 3D3  | 4D1      | -    | -      |  |  |
| 8    | 0     | 4D2  | 2D4  | _        | _    | 4D6    |  |  |
| :    |       | :    | :    | :        |      | •      |  |  |
| 19   | 1     | 10D5 | 9D3  | 10D1     | -    | -      |  |  |
| 20   | 0     | 10D2 | 8D4  | _        | -    | 10D6   |  |  |
| 21   | 1     | -    | 10D3 | -        | _    | -      |  |  |
| 22   | 0     | -    | 9D4  | -        | -    | -      |  |  |
| 23   | 1     | -    | -    | -        | _    | -      |  |  |
| 24   | 0     | -    | 10D4 | _        | -    | -      |  |  |
| 25   | 1     | _    | -    | -        | -    | -      |  |  |
|      |       |      |      | <b>A</b> |      |        |  |  |
|      |       |      |      |          |      |        |  |  |
|      | BLOCK | INT1 | INT2 | MEM1     | MEM2 | BRANCH |  |  |
|      | c:    | _    | C2   | _        | _    | _      |  |  |
| ļ    | · ·   | 1D5  | -    | 1D1      | _    | _      |  |  |
| 1    |       | 1D2  | _    |          | _    | 1D6    |  |  |
|      |       | 2D5  | 1D3  | 2D1      | _    |        |  |  |
|      |       | 2D2  | C1   | _        | _    | 2D6    |  |  |
| 1    | D:    | 3D5  | 2D3  | 3D1      | -    | _      |  |  |
| ł    | •     | 3D2  | 1D4  | _        | _    | 3D6    |  |  |
|      | E:    | _    | 3D3  | _        | _    | -      |  |  |
| 1    | •     | _    | 2D4  | -        | _    | -      |  |  |
|      |       | _    | _    | _        | _    | -      |  |  |
| ŀ    |       | _    | 3D4  | _        | -    | E1     |  |  |
| В    |       |      |      |          |      |        |  |  |

Figure 9: (A) Record of execution of multiple copies of basic block D being executed in a software pipeline. The schedules for all iterations are identical except that each one starts two cycles later than the previous one. From cycle 5 through cycle 20, there is a repeating pattern of execution with a period of two cycles. This is caused by repeatedly executing the kernel. Cycles 1 through 4, and cycles 21 through 25, are the result of executing the prologue and epilogue, respectively. (B) VLIW software pipelined code for loop D. The basic block D, after modulo scheduling, consists of only the kernel of the software pipelined loop. The prologue is combined with basic block C and the epilogue is combined with basic block E. Since 1D1 in the prologue is dependent on C2, C2 must be scheduled one cycle earlier that the start of the prologue. On the other hand, C1 can be scheduled as late as the last instruction in the prologue since it is the predecessor of 1D4. E1, the jump back to basic block F, must be scheduled in the last instruction of E, that is, at the end of the epilogue. Good superscalar code would result from a linear ordering of the VLIW code obtained by a left-to-right, top-to-bottom scan of the VLIW code.

Trace Scheduling. Intuitively, one might approach the problem of scheduling operations globally as follows: first, schedule each basic block; second, move operations about from block to block to improve the schedule. Unfortunately, this will cause too many decisions, such as which registers and functional units to use for each operation, to be made with a local perspective, causing many false conflicts from a global viewpoint and greatly limiting the quality of the schedule. Instead, trace scheduling ([29, 30]) works as follows:

1. The scheduler selects a trace, that is, a linear sequence of basic blocks. Frequency profiles are used to prune this selection to the most frequently executed code not yet scheduled.

By using loop unrolling and other techniques, large bodies of code can be considered simultaneously. For instance, the outer loop of the sample program could be unrolled once, producing a new copy of each block in the example program (Fig. 10A). Thus the program now consists of blocks 1A-1B-1C-1D-1E-1F-2A-2B-2C-2D-2E-2F. By using the execution profile (Fig. 2B), the compiler notes that the loop A-B-F is executed repeatedly, and it picks the trace 1A-1B-1F-2A-2B-2F as the section of code to schedule. (In practice, such a critical section of code might be unrolled a lot more, though the compiler must be mindful of how much code space is being used in doing so.)

2. The entire trace is considered at once. The compiler treats the trace as if it were a single basic block and schedules it in a manner similar to the local scheduling described above. During this process, branches are given no special consideration except that their order is not altered. Register allocation, functional unit selection, and so on, are done only as an operation is being scheduled.

In our example, the 22 operations from 1A-1B-1F-2A-2B-2F are considered together, and are placed into one DAG (Fig. 10B). In general, the compiler will build extra edges in this DAG to prevent the speculative execution of operations that would be illegal because they would have permanent effects.

3. After scheduling the trace, the compiler must under some circumstances duplicate operations that have been scheduled. This occurs for two reasons: first, some operations will have been scheduled after a conditional jump that they used to precede. In that case, the operation in question must be duplicated so that it appears in the off-trace target of the branch. Suppose in our example that an operation in block 1A were scheduled below 1A5. Then a new copy of that operation would have to be placed before the entrance to block 1C for subsequent scheduling when 1C is part of the picked trace.

Second, when there are rejoins, such as the one at the top of block 1F, the rejoin must jump to a place after which only operations that were below the original rejoin may be found. For instance, the rejoin at the top of 1F must be no earlier than instruction 9 since instruction 8 contains 1A5. The highest such place in the schedule formed by the compiler may be below the scheduled location of some of the operations that originally were after the rejoin. Unless certain conditions exist, these operations must be copied to the end of the off-trace block that is jumping to the new rejoin location. The rejoin at the top of 1F comes after 1F2 which must, therefore, be copied to the end of 1E.

In practice, this extra code has been relatively small, but, as we mentioned before, avoiding the generation of too much code when the flow of control is very complex can be a concern when scheduling for VLIWs and superscalar processors.

The result of this process is trace scheduled code (Fig. 11). With trace scheduling (and the previously applied software pipelining), our example executes in 8500 cycles, yielding 2.16 operations per cycle (Fig. 3E). This translates to a factor of 2.6 improvement over local ILP, and a 3.5 factor improvement over serial execution. In the procedure outlined above, we restricted ourselves to selecting a set of basic blocks that is a **trace**, that is, a linear path through the code. In many cases, it might be desirable to select a set of blocks that represent a more general flow of control. Extensions of the above scheduling procedure can handle this more general case ([31, 32]).

|              | Basi | c                     |
|--------------|------|-----------------------|
| Operation    | Bloc | k                     |
| <u>Label</u> | Labe | 1 Statement           |
| 1A1          | 1A:  | T1 = LOAD X(I-1)      |
| 1A2          |      | T2 = LOAD X(I)        |
| 1A3          |      | T3 = 2 * T1           |
| 1A4          |      | T4 = T3 - T2          |
| 1A5          |      | IF T4 > 0 GOTO 1C     |
| 1B1          | 1B:  | T5 = T2 - T1          |
| 1B2          |      | U1 = 2 * T5           |
| 1F1          | 1F:  | T8 = 3 * V            |
| 1F2          |      | V = T8 + U1           |
| 1F3          |      | I = I + 1             |
| 1F4          |      | IF I > 1000 GOTO EXIT |
| 2A1          | 2A:  | T11 = LOAD X(I-1)     |
| 2A2          |      | T12 = LOAD X(I)       |
| 2A3          |      | T13 = 2 * T11         |
| 2A4          |      | T14 = T13 - T12       |
| 2A5          |      | IF T14 > 0 GOTO 2C    |
| 2B1          | 2B:  | T15 = T12 - T11       |
| 2B2          |      | U2 = 2 * T15          |
| 2F1          | 2F:  | T18 = 3 * V           |
| 2F2          |      | V = T18 + U2          |
| 2F3          |      | I = I + 1             |
| 2F4          |      | IF I ≤ 1000 GOTO 1A   |
|              | G:   |                       |
|              |      | ${f A}$               |

Figure 10: (A) Code for the unrolled trace. The body of the outer loop now contains two copies each of every basic block in the original loop body. The labels of the two copies of a basic block are distinguished by a prefix of either 1 or 2. Likewise, the labels of the two copies of each operation are distinguished by a similar prefix. The temporary variables that hold the results of the second set of operations have been renamed in a systematic fashion. For instance, the temporary variable T1 has been renamed T11 in basic block 2A. Since the expectation is that the trace will tend not to be exited, the sense of branches 1A5, 1F4 and 2A5 has been reversed. The trace includes only the basic blocks 1A, 1B, 1F, 2A, 2B and 2F. There also are two copies, 1D and 2D, of the software pipelined inner loop as well as basic blocks 1C, 1E, 2C and 2E which are not shown.



Figure 10: (B) The computation graph for the code in the trace consisting of basic blocks 1A, 1B, 1F, 2A, 2B and 2F. The edges in the graph represent the dependences that exist between operations. The numbers on the edges are the latencies of the operations from which the edges emanate. (All edges flow from the higher end of the edge to the lower end).

| E  | PLOCK | INT1 | INT2 | MEM1 | MEM2       | BRANCH |
|----|-------|------|------|------|------------|--------|
| 1  | 1A:   | 1F1  | 1F3  | 1A2  | 1A1        | -      |
| 2  |       | -    | -    | 2A2  | 2A1        | -      |
| 3  |       | 1A3  | 1B1  | _    | _          | -      |
| 4  |       | 2A3  | 1B2  | _    | -          | -      |
| 5  |       | _    | 2B1  | _    | _          | _      |
| 6  |       | _    | 2B2  | _    | _          | -      |
| 7  |       | 1F2  | 1A4  | -    | _          | _      |
| 8  |       | 2A4  | 2F1  |      | <b>-</b> , | 1A5    |
| 9  | 1F:   | 2F3  | -    | -    | -          | 1F4    |
| 10 | 2A:   | -    | -    | -    | -          | 2A5    |
| 11 | 2F:   | 2F2  | _    | -    | -          | 2F4    |
| 12 | G:    | _    |      | _    |            | -      |

Figure 11: Scheduled code (which is also the record of execution) for the unrolled trace in Fig. 10. The operations in the trace are scheduled as if the trace is a single basic block, even though it actually consists of 6 basic blocks. This results in operations freely moving above or below branches determined only by what yields a good schedule. The relative ordering of branches is not altered. After scheduling, the demarcation between blocks can be re-established to determine which operations have moved from one basic block to another. Each branch defines the end of its basic block. Thus 1A ends at time 8 since 1A5 is scheduled that time. Likewise, 1F, 2A and 2F end at times 9, 10 and 11, respectively. The boundary between basic blocks 1B and 1F and between 2B and 2F are defined by the points at which the branches from 1E and 2E, respectively, enter the trace. The rule used here is that the rejoin after trace scheduling should be at the earliest point that does not include operations that were originally above the rejoin. The earliest instruction, that does not include any operations that originally were from 1A or 1B, is instruction 9. Since instruction 8 is part of 1A and instruction 9 is in 1F, this means that 1B is now an empty block. Likewise, 2B is an empty block. Once the basic blocks have been demarcated, it is clear what code motion has been effected. For instance, 2F1 has been speculatively moved up by five blocks, past three conditional branches and two rejoins, into 1A. In general, if an operation is moved up past a rejoin, it must be copied into the off-trace code just prior to the rejoin. The rejoin at the top of 1F comes after 1F2 which must, therefore, be copied to the end of 1E. Likewise, code that has moved down past a conditional branch must be copied into the off-trace code immediately following the exit. Once again, good superscalar code would result from a linear ordering of the VLIW code obtained by a left-to-right, top-to-bottom scan of the VLIW code.

Predicated Execution: As we saw earlier, in certain cases there exists an alternative to this speculative approach to exploiting inter-block parallelism. This arises when there are distinct computations or portions of the program that can be executed in parallel without having to speculate. A dataflow processor (with multiple loci of control) is well suited to exploiting such parallelism; each locus of control executes a separate portion of the computation. Since each locus

of control is independently and concurrently executing branches, the number of ways in which the overall computation can evolve is combinatorial in nature. This poses a problem for a VLIW or superscalar processor that is trying to emulate the record of execution of the dataflow machine, handicapped as it is by having only a single locus of control; a distinct code sequence must exist for each possible record of execution that can result on the dataflow machine, often leading to an intolerable amount of code.

Predicated execution is a mechanism that allows a uniprocessor to more efficiently emulate the dataflow processor. To begin with, all branches are eliminated in each of the code regions of interest and each operation is provided, as its predicate input, with a boolean value that is true if and only if flow of control would have passed through this operation in the original code. This process is termed IF-conversion. Given that all branches have been eliminated in the regions of interest, these regions can be scheduled to execute in parallel with no combinatorial problems of code size. During execution, the selective suppression of operations by the predicates yields the various combinations of execution records that would have been generated by a multiple locus machine. However, there are two drawbacks to this approach: first, the selective suppression of the predicated operations results in wasted execution cycles. Depending on the nature of the computation, the number of such wasted cycles may either be greater than or less than the number wasted during speculative execution. Second, the different paths through the original code may have different lengths. The IF-converted code must take as long as the longest path even when a shorter path is to be executed and, if this computation is on the critical path, performance is degraded.

Predicated execution is often beneficial when executing loops which have branches in the body of the loop. The trace A-B-F through the outer loop of the sample code is such an example since it contains the branch A5. The successive iterations of the trace are the computations that it is desired be executed in parallel. After IF-conversion, the operations in the trace can be software pipelined in much the same way that loop D was. This results in an overall execution time of 7008 cycles, yielding an average of 2.61 operations per cycle (Fig. 3F). This constitutes more than a fourfold speedup over the serial execution of the program. The software pipelining of a trace through the outer loop also illustrates the point that trace scheduling and software pipelining can be applied either singly, as alternatives to one another, or in conjunction with one another.

# 6 An Evaluation of the Alternatives

ILP is an extremely controversial field of computer science. Most researchers agree on the desirability and practicality of ILP and sophisticated compiling, but the consensus stops there. All three types of implementations discussed above have their advocates, as do various blends of those technologies. In this section we discuss briefly the arguments that have been advanced for and against the various alternatives.

Object Code Compatibility. Processors that can each run the same object code, unchanged, are said to be object-code compatible. Object-code compatibility is relevant to the design of ILP processors for two reasons: it is an advantage for a new family of ILP processors to run the large body of code that has formed around an existing family of sequential processors, and it is an advantage if all members of a new family of ILP processors offered by a manufacturer can share the same object code. In the presence of object-code compatibility only one copy of object code need be verified for correctness by a software producer, and only one copy need be distributed. Upgrading can be as simple as removing the old processor and plugging in a new one.

## Superscalar advocates' viewpoint:

This issue is one of the major incentives for the design of superscalar machines. Superscalars offer object code compatibility in both senses, since they run code written for existing sequential machines. In contrast, a VLIW processor expects code which is cognizant of the number of functional units and their latencies, that is, code which has been compiled expressly for that processor. VLIW code may not work correctly, or may lose performance, when moved to a new VLIW processor with different relative latencies among the functional units. It thus may have to be recompiled, losing object code compatibility. While a VLIW designer may add a small amount of superscalar hardware to solve the correctness problem, recompilation is the only practical alternative from a performance standpoint when relative latencies change.

#### VLIW advocates' viewpoint:

While this problem has been a large barrier to manufacturers wishing to offer VLIWs, there are solutions that can mitigate the situation. For example, future VLIWs can be built to be more flexible with respect to latencies, at least assuring correct interpretation of old code (though perhaps with lowered performance). Also, in future systems, it may be possible to relax strict object-code compatibility. For example, the paradigm of software distribution can change to one with a greater emphasis on a language level representation that is between the machine language and source language levels.

In addition, superscalars will soon approach the limits of what may be done in hardware (see argument below). When they do, we may find that superscalar processors are not quite as free of the object code compatibility problem as they initially seem to be. Compilers for any ILP processor benefit from a knowledge of the ILP aspects of the processor. Old code on a new superscalar processor, with very different latencies or numbers of functional units, may get little performance benefit from the new machine until it is recompiled. Once recompiled, it may run slower on the thousands of old machines, causing a dilemma for software distributors. Furthermore, how is a superscalar processor to handle aggressive speculative execution? If speculative code motions must be done in the compiler, they will be missing from existing code, and the new code will run more slowly on the old machines, due to the execution of discarded speculative operations. It is unclear whether the same object code can run at acceptable performance levels on a broad family of superscalar processors that offer the amount of ILP that was available on the VLIW processors built in the 1980's. This is an extremely important, open research question.

The Practicality of Smart Hardware. Superscalar processors analyze dependences and schedule operations in hardware as the code runs. VLIW processors carry out the same process in the compiler beforehand, leading to less hardware but longer compile times.

#### VLIW advocates' viewpoint:

A major advantage of VLIW processors is that, from a hardware standpoint, they are far simpler and far cheaper to build than are superscalar processors. Although small levels of ILP might be practical on a superscalar processor, the determination of independence can be too expensive to do in hardware if reasonably high levels of ILP are desired. Even with the execution hardware used in this paper (Fig. 1), when one is attempting to issue five operations per cycle, there can be as many as eleven operations being executed at any given time. It would be necessary to perform over 250 comparisons each cycle to check for operand interference. This work is of little consequence when done once per operation at compile-time, but is on the critical path of instruction issue when done at run-time for each operation executed.

Speculative execution performed at run-time can be costly as well. Whenever the processor comes to a point where it does not yet know which way to branch, it must decide which path, or paths, to follow, save enough of the processor state (in case it later needs to ignore the speculative operations) and then begin executing the speculative code. In contrast, when done at compile-time (as with a VLIW) this is a normal part of the compiling process. VLIW advocates argue that these constraints will limit the ILP that a superscalar can exploit to far less than is available, since the hardware quickly approaches the practical limits of what one can build. The compiler for a superscalar processor, too, could perform the code motions necessary to yield speculative execution. The superscalar machine would then only need to do the final scheduling of the speculative operation just as it would for any other operation. While this is entirely practical, it implies the creation of new object code, thereby discarding many of the hoped for advantages of superscalar machines.

## Superscalar advocates' viewpoint:

The additional hardware is costly, but it is necessary because object-code compatibility must be maintained. Furthermore, there is little evidence that there is enough ILP to keep busy even the execution hardware used in this paper, except in certain compute-intensive, scientific array-oriented codes. If one wants to speed up the latter codes, one can use vector processing. While that does introduce incompatible operations to an architecture, they are few and relatively standard operations. In addition, compiling for a VLIW may be itself compute intensive and unacceptable to users, though it is unknown whether compile time varies with the architectural type as much as with the degree of ILP being extracted.

In addition, superscalar advocates point out that at run-time there is more information available. For example, references to memory often involve locations whose addresses are computed as the program runs. A compiler for a VLIW sometimes cannot know whether two such references are the same, and must sometimes limit the ILP it exploits because of that missing knowledge (this is called the **memory disambiguation problem**). At run-time, the hardware can simply examine addresses and know exactly to which locations they refer.

Finally, clever techniques may be developed which will greatly reduce the amount of hardware and time required to do scheduling and analysis at run-time.

The Merits of Dataflow Processors. Dataflow architectures have held great promise for over two decades. The dataflow model of computation is also an important unifying concept in ILP: it serves as an idealized framework and reference standard for ILP. The dataflow debate revolves around its potential for massive levels of ILP with excellent object code compatibility, its shortcomings in supporting programs written in conventional languages or those with little parallelism, and the practicality of its synchronization hardware.

#### Dataflow advocates' viewpoint:

Superscalar and VLIW processors cannot match the ability of a dataflow machine to exploit global control parallelism, because with a single locus of control they must attempt to emulate the vast number of ways in which the computation can evolve on the dataflow machine. In principal, a compiler could create a code sequence for every eventuality, but this would produce programs with far too much object code. Superscalars and VLIWs can use predicated execution to get around this problem. While this has proven to be worthwhile in practice, it is only a pale imitation of what a dataflow machine can do. In fact, dataflow processors, or some variation thereof, are probably the only candidates for exploiting massive levels of ILP.

By virtue of the dependence information embedded in the program, dataflow processors have eliminated the fundamentally sequential process of extracting that information from a sequential program. At the same time, by virtue of the synchronization hardware, the execution of a program can be adjusted to reflect the number of functional units and their latencies. The dataflow architecture is well equipped to provide compatible execution of programs, at high performance, across a broad range of processors without recompilation.

Admittedly, the amount of synchronization overhead is excessive in the fine-grain versions of dataflow. The solution is to adopt a coarser-grain model in which the scheduled units of computation, consisting of multiple operations, may each be executed using a superscalar, a VLIW or even a sequential processor.

Although languages like Fortran and C are currently the most important languages, they are inherently flawed when it comes to expressing large amounts of instruction-level parallelism. If the potential of highly parallel ILP processors is to be demonstrated, it will have to be via functional languages. Once the potential has been demonstrated, the incentive will exist for new programs to be written in functional languages. Only in this manner can highly parallel processing become a reality.

## Dataflow sceptics' viewpoint:

Ignoring imperative languages, such as Fortran and C, immediately removes dataflow from being a viable candidate for wide-spread usage in the foreseeable future. On the other hand, if an attempt is made to support imperative languages, existing dataflow models will be unable to cope with the unstructured control flow and updateable memory of these languages. Extending the dataflow model to handle this will entail the introduction of large numbers of additional dependence arcs, the enforcement of which will generate even more synchronization overhead.

This is a problem since dataflow processors already perform far too many overhead operations including, on the average, two synchronizations and two result messages per operation executed as well as numerous copy operations at each split or merge in the control flow graph. Although some of the tasks that a superscalar processor must perform are not required of a dataflow processor, it still entails too much hardware dedicated to run-time synchronization and scheduling.

Dataflow processors have been designed on the assumption that there will always be enough parallelism available in the computation to cover up any latency resulting from operation execution and data transmission. As a result, little importance has been attached to incorporating into the dataflow processor even features as basic as data caches. For programs with little inherent parallelism, this, along with the relatively long latencies through the dataflow processor's interconnection network, could lead to poorer performance on a dataflow processor than on a conventional uniprocessor.

Software Pipelining vs. Loop Unrolling: The compiler techniques that we have described apply equally to VLIW and superscalar implementations. However, there are many open questions in the areas of scheduling and code generation. For example, we have presented two different approaches to scheduling loop codes: software pipelining and the trace scheduling of unrolled loops. There are examples of code where one or the other is clearly superior, but little is known about where the boundary lies. At other times, software pipelining and trace scheduling are complementary and work well in concert. Both sets of techniques have been implemented in commercial processors ([19, 20]), but neither implementation was a suitable testbed on which these issues could be explored.

#### 7 Future Work

In the past three years, ILP has gone from being a thirty year old field of research with a handful of papers produced each year, to being one of the two or three dominant areas in the field of computer architecture. As is obvious from the disagreements expressed in the previous section, research in progress includes questions central to the future design and use of ILP systems. Much of this research involves the quantitative evaluation of the alternatives presented above, and much work remains in the invention and prototyping of new techniques to exploit ILP.

One area of investigation involves measuring how much parallelism is available in ordinary programs. Many of the disagreements explored above may boil down to a question of how much ILP is available in ordinary programs. But, in fact, there is little agreement about what workloads matter. Often, depending upon their backgrounds, researchers have very different views about what types of programs are important. The amount of ILP available varies dramatically with the the choice of programs being measured. If relatively little ILP is available, perhaps a factor of 2-3, then the arguments in favor of superscalar architectures may become overwhelming. If instead the typical available improvement is in the 5-20 range, then the objections to VLIWs may be small compared to the difficulty of building a superscalar to exploit that much ILP. Finally, if massive quantities of ILP are typically available, dataflow may turn out to be the only architecture that can exploit it. In the authors' opinion, most of the debates about ILP will remain hollow until the amount of parallelism available in ordinary programs is quantified.

Another important area of research is the design of hybrid architectures and processors which distill the good properties of each of the above classifications while overcoming the shortcomings. The Horizon processor is a good example of such a compromise. Other important questions remain in the design of processors, such as whether to issue multiple operations per cycle or whether to design machines which are superpipelined.

ILP is an area that has had a relatively real-world orientation. There have been several commercial implementations of ILP processors which have been quite novel and ambitious, but which have nevertheless only scratched the surface of what is possible. Much engineering and research remains outstanding in the areas of register allocation, handling general flow of control, the use of predicates, and so forth. Studies have shown that the bulk of ILP is accessible only when one is willing to do large amounts of speculative execution. Yet, only a few systems have begun to wrestle with the practical implications of speculatively triggering error conditions, which might or might not be spurious. Also, none of the systems built to date have seriously tried to extend ILP to nonscientific codes, where the memory disambiguation problem becomes much more difficult.

There is little argument about the desirability of allowing software to arrange code for more ILP. But whether a large increment over what can be done today is desirable, possible, or practical are controversial questions. Whether a significant amount of this work can realistically be done in the hardware is also an area with more opinions than facts.

## References

- 1. P. P. Gelsinger, et al. Microprocessors circa 2000. IEEE Spectrum (1989), 43-47.
- 2. J. E. Thornton. Parallel Operations in the Control Data 6600. <u>Proc. AFIPS Conference</u> (1964), 33-40.
- 3. W. Buchholz, ed. <u>Planning A Computer System: Project Stretch</u>. (McGraw-Hill, New York, 1962).
- 4. D. W. Anderson, F. J. Sparacio, and R. M. Tomasulo. The System/360 Model 91: Machine Philosophy and Instruction Handling. <u>IBM Journal of Research and Development</u> 11, (1967), 8-24.
- 5. A. E. Charlesworth. An Approach to Scientific Array Processing: The Architectural Design of the AP-120B/FPS-164 Family. <u>IEEE Computer</u> 14, 9 (1981), 18-27.
- 6. <u>80960CA User's Manual</u>. 1989, Intel Corporation: Santa Clara, California.
- 7. <u>i860 64-Bit Microprocessor Programmer's Reference Manual</u>. 1989, Intel Corporation: Santa Clara, California.
- 8. (Multiple articles). IBM Journal of Research and Development 34, 1 (1990).
- 9. E. M. Riseman and C. C. Foster. The inhibition of potential parallelism by conditional jumps. IEEE Transactions on Computers 21, 12 (1972), 1405-1411.
- 10. C. C. Foster and E. M. Riseman. Percolation of code to enhance parallel dispatching and execution. <u>IEEE Transactions on Computers</u> 21, 12 (1972), 1411-1415.
- 11. A. Nicolau and J. A. Fisher. Using an oracle to measure parallelism in single instruction stream programs. Proc. Fourteenth Annual Microprogramming Workshop (1981), 171-182.
- 12. D. W. Wall. Limits of instruction-level parallelism. <u>Proc. Fourth International Conference on Architectural Support for Programming Languages and Operating Systems</u> (1991), 176-188.
- 13. S. McFarling and J. Hennessy. Reducing the cost of branches. <u>Proc. Thirteenth International Symposium on Computer Architecture</u> (Tokyo, Japan, 1986), 396-403.
- 14. J. A. Fisher and S. Freudenberger, <u>The static jump predictability of programs+data for instruction-level parallelism</u>. to be published 1992, Hewlett-Packard Laboratories: Palo Alto, California.
- 15. M. Johnson. <u>Superscalar Microprocessor Design</u>. (Prentice-Hall, Englewood Cliffs, New Jersey, 1991).
- 16. Arvind and V. Kathail. A multiple processor dataflow machine that supports generalised procedures. Proc. Eighth Annual Symposium on Computer Architecture (1981), 291-302.

- 17. Arvind and K. Gostelow. The U-interpreter. IEEE Computer 15, 2 (1982).
- 18. J. Gurd, C. C. Kirkham, and I. Watson. The Manchester Prototype Dataflow Computer. Communications of the ACM 28, 1 (1985), 34-52.
- 19. R. P. Colwell, et al. A VLIW Architecture for a Trace Scheduling Compiler. Proc. Second International Conference on Architectural Support for Programming Languages and Operating Systems (Palo Alto, California, 1987), 180-192.
- 20. B. R. Rau, et al. The Cydra 5 Departmental Supercomputer: Design Philosophies, Decisions and Trade-offs. IEEE Computer 22, 1 (1989).
- 21. K. Hwang and F. A. Briggs. Computer Architecture and Parallel Processing. (McGraw-Hill, New York, 1984).
- 22. R. M. Tomasulo. An efficient algorithm for exploiting multiple arithmetic units. <u>IBM Journal</u> of Research and Development 11, (1967), 25-33.
- 23. N. P. Jouppi. The Nonuniform Distribution of Instruction-Level and Machine Parallelism and Its Effect on Performance. <u>IEEE Transactions on Computers</u> 38, 12 (1989), 1645-1658.
- 24. M. R. Thistle and B. J. Smith. A processor architecture for Horizon. <u>Proc. Supercomputing</u> '88 (Orlando, Florida, 1988), 35-41.
- 25. B. R. Rau and C. D. Glaeser. Some scheduling techniques and an easily scheduleable horizontal architecture for high performance scientific computing. <u>Proc. Fourteenth Annual Workshop on Microprogramming</u> (1981), 183-198.
- 26. P. Y. T. Hsu, <u>Highly Concurrent Scalar Processing</u>. Coordinated Science Lab., University of Illinois: Urbana, Illinois, 1986.
- 27. M. Lam. Software pipelining: an effective scheduling technique for VLIW machines. <u>Proc. ACM SIGPLAN '88 Conference on Programming Language Design and Implementation</u> (1988), 318-327.
- 28. J. C. Dehnert, P. Y.-T. Hsu, and J. P. Bratt. Overlapped Loop Support in the Cydra 5. <u>Proc. Third International Conference on Architectural Support for Programming Languages and Operating Systems</u> (Boston, Mass., 1989), 26-38.
- 29. J. A. Fisher. Trace scheduling: A technique for global microcode compaction. <u>IEEE Transactions on Computers</u> 30, 7 (1981).
- 30. J. R. Ellis. <u>Bulldog: A Compiler for VLIW Architectures</u>. (The MIT Press, Cambridge, Mass., 1985).
- 31. J. L. Lynn. SRDAG compaction--a generalization of trace scheduling to increase the use of global context information. <u>Proc. Sixteenth Annual Workshop on Microprogramming</u> (1983), 11-22.
- 32. J. A. Fisher, <u>Trace Scheduling-2</u>, an extension of trace scheduling, to be published 1992, Hewlett-Packard Laboratories: Palo Alto, California.